Create Preference Dataset for DPO Fine-Tuning
Leveraging LLMs for creating preference dataset for DPO Fine-Tuning
In the realm of Supervised Fine-Tuning (SFT) for custom LLM, Direct Preference Optimization (DPO) is a technique used to align AI-generated outputs with human preferences by optimizing language models.
To achieve this, a preference dataset is required, containing data that enables models to understand which responses are preferred by humans and which are not.
In this article, we’ll walk through an example to create such a dataset using Python, OpenAI’s API, and Hugging Face’s Datasets library.
Let’s dive in.
Components of a Preference Dataset for DPO
A preference dataset typically includes:
- Prompts: Inputs or questions given to the AI model.
- Chosen Responses: Responses preferred by human evaluators.
- Rejected Responses: Less preferred responses or responses not selected by human evaluators.
By providing this structure, the dataset allows a model to learn which responses are preferable, making it better aligned with human preferences.
Our use-case
In our previous post, we created an instruction dataset, TinyStories_Instruction, from the raw TinyStories dataset. This dataset was specifically designed for fine-tuning a pretrained Large/Small Language Model to develop a story generator tailored to 5-year-olds.
In this guide, we take the next step by creating a preference dataset from the previously generated instruction dataset. This dataset is used for fine-tuning a pretrained Large/Small Language Model through Direct Preference Optimization (DPO), enhancing our story generator to align even better with human preferences and produce engaging, age-appropriate content for young children.
The process for creating a preference dataset is illustrated below:
Implementation
The implementation involves a series of steps: extracting data, generating AI responses, and creating preference triplets.
We will first import the required packages.
import concurrent.futures
import json
from concurrent.futures import ThreadPoolExecutor
from typing import List, Tuple
from datasets import Dataset, load_dataset, concatenate_datasets
from openai import OpenAI
from tqdm.auto import tqdm
from google.colab import userdata1. Data Extraction Function
The extract_ground_instruction_story function extracts pairs of instructions and desired outputs from a given dataset.
def extract_ground_instruction_story(dataset):
return [(example['instruction'], example['output']) for example in dataset]2. Creating a PreferenceSet Class
The PreferenceSet class manages and stores the triples of (instruction, generated story, desired story).
class PreferenceSet:
def __init__(self, triples: List[Tuple[str, str, str]]):
self.triples = triples
@classmethod
def from_json(cls, json_str: str, instruction, desired_story) -> 'PreferenceSet':
data = json.loads(json_str)
triples = [(instruction, data['generated_story'], desired_story)]
return cls(triples)
def __iter__(self):
return iter(self.triples)3. Generating Preference-Response Triplets
The function generate_preference_answer_triples generates a story using OpenAI’s API and returns a preference triple in the format (instruction, generated response, desired response).
def generate_preference_answer_triples(instruction: str, desired_story: str, client: OpenAI) -> List[Tuple[str, str, str]]:
prompt = f"""Based on the following instruction, generate a story. \
Story should be no longer than 50 words. Story uses several complex words or structures \
that are not suitable for 5-year-olds.
Provide your response in JSON format with the following structure:
{{"generated_story": "..."}}
Instruction:
{instruction}
"""
completion = client.chat.completions.create(model="gpt-4o-mini",
messages=[
{"role": "system",
"content": "You are a helpful assistant who \
generates story based on the given instruction. \
Provide your response in JSON format.",},
{"role": "user", "content": prompt},
],
response_format={"type": "json_object"},
max_tokens=512,
temperature=0.2,)
result = PreferenceSet.from_json(completion.choices[0].message.content, instruction, desired_story)
# Convert to list of tuples
return result.triples4. Creating the Preference Dataset
The function create_preference_dataset creates a dataset using the extracted stories and generated responses.
def create_preference_dataset(dataset: Dataset, client: OpenAI, num_workers: int = 4) -> Dataset:
stories = extract_ground_instruction_story(dataset)
instruction_answer_triples = []
with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [executor.submit(generate_instruction_answer_triples, instruction, desired_story, client) for instruction, desired_story in stories]
for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
instruction_answer_triples.extend(future.result())
instructions, rejected_story, chosen_story = zip(*instruction_answer_triples)
return Dataset.from_dict({
"prompt": list(instructions),
"rejected": list(rejected_story),
"chosen": list(chosen_story)
})5. The main function
The main function initializes the OpenAI client, loads the dataset, creates a preference dataset, and uploads it to the Hugging Face Hub.
def main() -> Dataset:
client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))
# 1. Load the raw data
# Load the train and test splits
train_dataset = load_dataset("tanquangduong/TinyStories_Instruction", split="train")
test_dataset = load_dataset("tanquangduong/TinyStories_Instruction", split="test")
# Combine the datasets
raw_dataset = concatenate_datasets([train_dataset, test_dataset])
print("Raw dataset:")
print(raw_dataset.to_pandas())
# 2. Create preference dataset
preference_dataset = create_preference_dataset(raw_dataset, client)
print("Preference dataset:")
print(preference_dataset.to_pandas())
# 3. Train/test split and export
filtered_dataset = preference_dataset.train_test_split(test_size=0.1)
filtered_dataset.push_to_hub("tanquangduong/TinyStories_Preference")6. Running the pipeline
Finally, we authenticate with Hugging Face for later dataset uploading and start running the pipeline.
from huggingface_hub import login
# Log in to the Hugging Face Hub
login(token=userdata.get('HF_TOKEN'))
# Launch the pipeline to create instruction dataset
main()Resulting Preference Dataset
After running the above pipeline, the resulting preference dataset will look like this:
Conclusion
The article outlines the process of creating a Preference Dataset for fine tuning with DPO to align AI-generated outputs with human preferences.
The dataset consists of prompts, human-preferred responses, and rejected responses, allowing the model to learn desired behavior. Key steps include extracting instruction-output pairs, generating AI responses using OpenAI’s API, and organizing the data into preference triplets.
The final dataset is then uploaded to the Hugging Face Hub for later use.
We will use this preference dataset for fine-tuning with DPO in our upcoming post.